fix: avoid blocking unrelated sessions during pool initialization (#256) by clsung · Pull Request #257 · openabdev/openab

clsung · 2026-04-12T09:54:18Z

Refactor SessionPool to move expensive AcpConnection initialization outside of the write lock. This ensures that a single slow or stuck session startup does not block access to other active sessions in the pool.

Changes:

Wrap AcpConnection in Arc<Mutex<...>> for granular per-session locking.
Switch SessionPool to use read locks for connection retrieval.
Perform connection spawning and initialization before acquiring the pool's write lock.
Use a fast-path read lock to check session health before attempting a rebuild.
fixes bug: SessionPool lock scope crosses await and can stall prompt handling #256

Discord Discussion URL: https://discord.com/channels/1491295327620169908/1491365150664560881/1493769615984164975

clsung · 2026-04-15T00:24:25Z

Rebased this branch onto current openabdev/openab main and ported the fix onto the post-#310 PoolState design.

What changed:

moved per-session access to Arc<tokio::sync::Mutex<AcpConnection>>
dropped the global pool/state lock before long async work in get_or_create()
changed with_connection() to clone the per-connection handle under a read lock, then lock only that session
kept #310 suspend/resume behavior instead of reverting it
kept cleanup non-blocking by probing per-connection locks outside the global state lock
fixed two ABA-style races in eviction/cleanup by removing entries only when the sampled Arc handle still matches the current map entry

Validation:

cargo test passes on the rebased branch
added focused unit tests for the identity-checked removal helper
current result: 42 passed; 0 failed

This PR now targets the lock-scope bug from #256 without discarding the merged #310 pool changes.

masami-agent

Code Review — PR #257

Thanks for the thorough refactor, @clsung. This addresses a real p1 issue — the current pool holds a write lock across AcpConnection::spawn() + initialize() + session_new(), which are all async and can take seconds. During that time, every other thread is blocked.

Architecture change

	Before	After
Connection storage	`HashMap<String, AcpConnection>`	`HashMap<String, Arc<Mutex<AcpConnection>>>`
Pool lock during spawn	Write lock held across spawn + init + session_new	Read lock for snapshot, no lock during spawn, write lock only for final insert
`with_connection`	Write lock on pool state	Read lock on pool state + per-connection mutex
Eviction	Under write lock, direct access	Snapshot + `try_lock` + `remove_if_same_handle`
Cleanup	Under write lock	Snapshot + `try_lock` + deferred removal

This is a significant improvement. The key insight — moving expensive async work outside the pool lock — is correct and well-executed.

✅ What looks good

remove_if_same_handle — Smart use of Arc::ptr_eq to prevent ABA problems. If another task replaced the connection between snapshot and removal, the stale handle is not removed. Well-tested with two unit tests.
with_connection now uses read lock — This is the biggest win. Previously, a streaming prompt held a write lock on the entire pool. Now it only holds a per-connection mutex, so other threads are unblocked.
cleanup_idle uses try_lock — Busy connections are skipped instead of blocking the cleanup sweep. Correct behavior: a busy session is by definition not idle.
Race condition handling in get_or_create — After spawning outside the lock, re-checks if another task already created a healthy connection for the same thread. Good.
HRTB fix on with_connection — for<'a> FnOnce(&'a mut AcpConnection) is the correct lifetime bound for the new pattern where the mutex guard (not the pool state) owns the borrow.

🔴 Must fix before merge

1. Eviction can fail silently, leaving pool permanently full

if state.active.len() >= self.max_sessions {
    if let Some((key, expected_conn, _, sid)) = eviction_candidate {
        if remove_if_same_handle(&mut state.active, &key, &expected_conn).is_some() {
            // ...
        }
    }
}

if state.active.len() >= self.max_sessions {
    return Err(anyhow!("pool exhausted ({} sessions)"));
}

If eviction_candidate is None (all other connections were locked via try_lock) or remove_if_same_handle returns None (connection was replaced), the pool stays full and returns an error. This is correct as a safety net, but the original code would always find an oldest entry because it had direct access under the write lock.

In the new design, under high concurrency, try_lock could fail on all connections, making eviction impossible even though idle sessions exist. Consider falling back to lock().await on the oldest candidate if try_lock fails on all entries, or at least logging a warning when eviction fails due to lock contention.

2. saved_session_id from suspended map is read under read lock but removed under write lock later — potential double-resume

let (existing, saved_session_id) = {
    let state = self.state.read().await;
    (
        state.active.get(thread_id).cloned(),
        state.suspended.get(thread_id).cloned(),  // read, not removed
    )
};

The suspended session ID is read (cloned) but not removed from the map. Later, state.suspended.remove(thread_id) happens under the write lock. But between the read and the write, another task for the same thread could also read the same suspended ID and attempt session_load with it. Two connections could try to resume the same session.

This is unlikely (same thread_id racing) but possible if a user sends two messages rapidly in the same thread. Consider either:

Accepting this as a known edge case (session_load is idempotent-ish)
Or adding a comment documenting the race window

🟡 Non-blocking

3. had_existing flag for session_reset

let had_existing = existing.is_some();
// ...
if !resumed {
    new_conn.session_new(&self.config.working_dir).await?;
    if had_existing || saved_session_id.is_some() {
        new_conn.session_reset = true;
    }
}

The original code only set session_reset when saved_session_id.is_some(). This PR adds had_existing as an additional condition. This is a behavioral change — if a stale connection existed but had no suspended session, session_reset is now set. Is this intentional? If so, a comment explaining why would help.

4. suspend_entry function removed

The suspend_entry helper is removed and its logic is inlined in multiple places. This is fine for now, but if more callers need suspension logic, consider extracting a helper again.

5. Connection drop timing

In the old code, suspend_entry explicitly drops the connection (triggering process group kill) inside the write lock. In the new code, the old Arc<Mutex<AcpConnection>> is dropped when remove_if_same_handle returns it and it goes out of scope. This should be fine since the Arc refcount will reach zero, but worth verifying that no other clone holds a reference that would delay the drop.

📝 Summary

Priority	Item	Status
🔴	Eviction can fail silently under lock contention	Needs fix or documented fallback
🔴	Potential double-resume race on suspended session	Needs comment or fix
🟡	`had_existing` behavioral change	Needs clarification
🟡	`suspend_entry` inlined	Non-blocking
🟡	Connection drop timing	Non-blocking

The core design is sound and addresses the p1 issue well. The two 🔴 items are edge cases but worth addressing before merge given this is the session pool — the most critical concurrency component in the system.

Looking forward to the next revision!

clsung · 2026-04-15T06:22:36Z

Addressed the review follow-up on the latest revision:

#1 full-pool eviction under contention: kept the safe pool exhausted fallback, and now log explicit warnings when eviction fails because the sampled candidate changed or because all other sessions were busy during the eviction scan
#2 double-resume window: added a per-thread creation gate (creating) so same-thread get_or_create() / resume work is serialized and two rapid requests cannot race into duplicate session/load attempts
#3 session_reset behavior: added an inline comment documenting why stale live entries without a resumable session id still set the reset banner

Validation:

added a unit test for the per-thread gate helper
cargo test passes: 43 passed; 0 failed

Latest commit: 3fc2377

masami-agent

Re-review — all items addressed ✅

@clsung addressed all three review items cleanly. Verified the full diff (not just the items I raised).

🔴 #1 — Eviction under contention

Kept pool exhausted as the safe fallback (correct — better to reject than corrupt state). Added two explicit warning logs:

"pool full but eviction candidate changed before removal" — ABA case
"pool full but all other sessions were busy during eviction scan" — all try_lock failed

Operators can now diagnose eviction failures from logs. Acceptable approach.

🔴 #2 — Double-resume race

Added a per-thread creation gate (creating: HashMap<String, Arc<Mutex<()>>>). The _create_guard at the top of get_or_create serializes same-thread calls, completely eliminating the race window. The write lock for get_or_insert_gate is brief (just a HashMap insert) and released before any heavy async work. Clean solution.

Unit test get_or_insert_gate_reuses_gate_for_same_thread confirms gate reuse via Arc::ptr_eq.

🟡 #3 — `session_reset` behavioral change

Inline comment added explaining the intent — stale live entries without a resumable session ID still warrant the reset banner. Makes sense.

Full diff verification

✅ Only src/acp/pool.rs changed — no Cargo.toml/lock version issues
✅ Lock ordering is correct: write lock for gate insert → release → per-thread gate → read lock for snapshot → no lock during spawn/init → write lock for final insert
✅ cleanup_idle and with_connection patterns are consistent with get_or_create
✅ 3 unit tests covering remove_if_same_handle (match + mismatch) and get_or_insert_gate
✅ cargo test: 43 passed, 0 failed

Recommend squash merge.

masami-agent · 2026-04-15T06:34:35Z

@clsung Two items to fix before this can merge:

1. CI fail — clippy type_complexity

error: very complex type used
--> src/acp/pool.rs:103
|   let mut eviction_candidate: Option<(String, Arc<Mutex<AcpConnection>>, Instant, Option<String>)> = None;

Simplest fix: add a type alias, e.g.

type EvictionCandidate = (String, Arc<Mutex<AcpConnection>>, Instant, Option<String>);

then use Option<EvictionCandidate>.

2. Version regression — Cargo.toml
PR branch has version = "0.7.4" but main is already at 0.7.5. Please rebase or update to match main.

clsung · 2026-04-15T06:46:21Z

Addressed the two latest maintainer follow-ups:

fixed the clippy type_complexity complaint in src/acp/pool.rs by introducing an EvictionCandidate type alias
synced the crate version to current main (0.7.5) in Cargo.toml and the openab package entry in Cargo.lock

Validation:

cargo test passes: 43 passed; 0 failed
cargo clippy --all-targets -- -D warnings passes

Latest commit: deafa0e

masami-agent · 2026-04-16T08:19:46Z

@clsung Two items before we can merge:

Merge conflict — please rebase onto latest main
Version regression — PR branch is at 0.7.5 but main is now 0.7.7. Please sync the version after rebase.

obrutjack · 2026-04-16T08:24:39Z

Code logic reviewed and looks solid — Arc<Mutex<AcpConnection>> per-connection locking, creating gate for same-thread serialization, and remove_if_same_handle with Arc::ptr_eq are all correct patterns.

Two blockers before I can approve:

Rebase onto main — currently has merge conflicts
Sync Cargo.toml version — PR is at 0.7.5, main is 0.7.7

Once rebased with CI green and no version regression, I'll approve and we can proceed to merge.

…enabdev#256) Refactor SessionPool to move expensive AcpConnection initialization outside of the write lock. This ensures that a single slow or stuck session startup does not block access to other active sessions in the pool. Changes: - Wrap AcpConnection in Arc<Mutex<...>> for granular per-session locking. - Switch SessionPool to use read locks for connection retrieval. - Perform connection spawning and initialization before acquiring the pool's write lock. - Use a fast-path read lock to check session health before attempting a rebuild.

clsung · 2026-04-16T09:12:52Z

Rebased bugfix/session_pool onto latest main and force-pushed the branch.

Current branch state:

based on latest canonical main
crate version synced to 0.7.7
latest commit: e10f3d2

Validation after rebase:

cargo test passes: 43 passed; 0 failed
cargo clippy --all-targets -- -D warnings passes

This should clear the remaining merge-conflict and version-regression blockers.

obrutjack

Merge checklist verified:

✅ CI all green (cargo check + clippy + 7 Docker smoke tests)
✅ No merge conflicts
✅ No version regression (0.7.7 matches main)
✅ Clippy fixed (type alias for EvictionCandidate)
✅ masami-agent approved
✅ Code logic reviewed: Arc<Mutex> per-connection locking, creating gate, remove_if_same_handle — all correct

Pending code owner review from @thepagent to merge.

thepagent · 2026-04-16T15:50:05Z

Staging after next stable release.

chaodu-agent · 2026-04-17T17:00:14Z

Community Triage Notes — Edge Cases & Minor Concerns

Overall this PR is solid and correctly addresses the core lock-scope issue from #256. Recording a few edge cases and minor concerns here for future reference in case they come up.

1. `with_connection` has no timeout on per-connection lock

After dropping the pool read lock, with_connection calls conn.lock().await with no timeout. If a previous task holds the connection mutex indefinitely (e.g. a stuck prompt stream — the exact scenario #256 describes), the next same-thread request will block forever at this point. The creating gate serializes the create/resume path, but with_connection contention is not yet guarded. Worth considering adding tokio::time::timeout or a try_lock + retry pattern here if same-thread hangs resurface.

2. Failed init does not clean up stale active entry

If get_or_create detects a stale connection but then fails during spawn / initialize / session_new, the stale Arc<Mutex<AcpConnection>> remains in state.active. Subsequent calls will re-detect it as stale and retry, but the dead entry occupies a pool slot in the meantime. Consider removing the stale entry before attempting init, or cleaning it up in the error path.

3. `creating` gate map grows unbounded

PoolState::creating inserts an Arc<Mutex<()>> per thread_id and never removes them. Since pool size is capped this is unlikely to be a practical issue, but for long-running deployments with many distinct thread IDs, it might be worth pruning gates for threads no longer in active during cleanup_idle.

4. `session_reset` banner on never-alive connections

The condition had_existing || saved_session_id.is_some() will set session_reset = true even when the existing entry was never alive and has no resumable session ID. The false-positive reset banner is harmless but slightly imprecise semantically.

None of these are blockers — just documenting them so we have a reference if any of these edge cases surface in production.

clsung requested a review from thepagent as a code owner April 12, 2026 09:54

clsung force-pushed the bugfix/session_pool branch from 32087ea to 019a125 Compare April 15, 2026 00:19

github-actions bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 15, 2026

thepagent mentioned this pull request Apr 15, 2026

refactor(pool): per-connection Arc<Mutex> to unblock concurrent sessions #183

Closed

thepagent added the p1 High — address this sprint label Apr 15, 2026

github-actions bot removed the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 15, 2026

masami-agent reviewed Apr 15, 2026

View reviewed changes

masami-agent approved these changes Apr 15, 2026

View reviewed changes

clsung added 4 commits April 16, 2026 17:02

fix: guard session pool handle replacement races

145971f

fix: serialize same-thread session rebuilds

ff56ed1

fix: satisfy clippy and sync crate version

e10f3d2

clsung force-pushed the bugfix/session_pool branch from deafa0e to e10f3d2 Compare April 16, 2026 09:06

obrutjack approved these changes Apr 16, 2026

View reviewed changes

thepagent approved these changes Apr 17, 2026

View reviewed changes

thepagent merged commit dbfa231 into openabdev:main Apr 17, 2026
9 checks passed

thepagent mentioned this pull request Apr 17, 2026

fix(discord): handle_message blocks event loop during streaming, starving other sessions #429

Closed

ruan330 mentioned this pull request Apr 18, 2026

fix: pool write lock held during entire prompt streaming causes cross-session deadlock #58

Closed

obrutjack mentioned this pull request Apr 19, 2026

bug: with_connection global write lock + unclosed notify channel freezes bot when ACP process goes stale #295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid blocking unrelated sessions during pool initialization (#256)#257

fix: avoid blocking unrelated sessions during pool initialization (#256)#257
thepagent merged 4 commits intoopenabdev:mainfrom
clsung:bugfix/session_pool

clsung commented Apr 12, 2026 •

edited

Loading

Uh oh!

clsung commented Apr 15, 2026

Uh oh!

masami-agent left a comment

Uh oh!

clsung commented Apr 15, 2026

Uh oh!

masami-agent left a comment

Uh oh!

masami-agent commented Apr 15, 2026

Uh oh!

clsung commented Apr 15, 2026

Uh oh!

masami-agent commented Apr 16, 2026

Uh oh!

obrutjack commented Apr 16, 2026

Uh oh!

clsung commented Apr 16, 2026

Uh oh!

obrutjack left a comment

Uh oh!

thepagent commented Apr 16, 2026

Uh oh!

chaodu-agent commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

clsung commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clsung commented Apr 15, 2026

Uh oh!

masami-agent left a comment

Choose a reason for hiding this comment

Code Review — PR #257

Architecture change

✅ What looks good

🔴 Must fix before merge

🟡 Non-blocking

📝 Summary

Uh oh!

clsung commented Apr 15, 2026

Uh oh!

masami-agent left a comment

Choose a reason for hiding this comment

Re-review — all items addressed ✅

🔴 #1 — Eviction under contention

🔴 #2 — Double-resume race

🟡 #3 — session_reset behavioral change

Full diff verification

Uh oh!

masami-agent commented Apr 15, 2026

Uh oh!

clsung commented Apr 15, 2026

Uh oh!

masami-agent commented Apr 16, 2026

Uh oh!

obrutjack commented Apr 16, 2026

Uh oh!

clsung commented Apr 16, 2026

Uh oh!

obrutjack left a comment

Choose a reason for hiding this comment

Uh oh!

thepagent commented Apr 16, 2026

Uh oh!

chaodu-agent commented Apr 17, 2026

Community Triage Notes — Edge Cases & Minor Concerns

1. with_connection has no timeout on per-connection lock

2. Failed init does not clean up stale active entry

3. creating gate map grows unbounded

4. session_reset banner on never-alive connections

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clsung commented Apr 12, 2026 •

edited

Loading

🟡 #3 — `session_reset` behavioral change

1. `with_connection` has no timeout on per-connection lock

3. `creating` gate map grows unbounded

4. `session_reset` banner on never-alive connections